Skip to content

Serving and consuming an HTTP multipart/mixed response in Python#33

Merged
ianmcook merged 12 commits into
apache:mainfrom
felipecrv:multipart
Aug 29, 2024
Merged

Serving and consuming an HTTP multipart/mixed response in Python#33
ianmcook merged 12 commits into
apache:mainfrom
felipecrv:multipart

Conversation

@felipecrv

@felipecrv felipecrv commented Aug 29, 2024

Copy link
Copy Markdown
Contributor

The client parses the multipart response produced by server/server.py
by using the multipart message parser from the Python email module.

This module puts the entire message in memory and seems to spend a lot
of time looking for part delimiter and encoding/decoding the parts.

The overhead of multipart/mixed parsing is 85% on my machine and after
the ~1GB Arrow Stream message is fully in memory, it takes only 0.06%
of the total execution time to parse it.

$ python simple_client.py
-- 3731 bytes of JSON data:
[
  {'ticker': 'SGJ', 'description': 'Syhnffek Gacb Jdylqis'}
  {'ticker': 'EILD', 'description': 'Eicfef Iiafeutm Lydut Dbmgq'}
  {'ticker': 'QTO', 'description': 'Qclxkqjd Tkxan Odmac'}
  {'ticker': 'IHTS', 'description': 'Iowjy Hieuj Tvwecy Smxedh'}
  {'ticker': 'TGFJ', 'description': 'Tvztlhba Garebomj Fnwvwgf Jffldbg'}
  ...+55 entries...
]
-- 988931832 bytes of Arrow data:
Schema:
ticker: string
price: int64
volume: int64

Parsed 42000000 records in 6836 batch(es)
-- Text Message:
Hello Client,

6836 Arrow batch(es) were sent in 6.561 seconds through 6837 HTTP
response chunks. Average size of each chunk was 144644.13 bytes.

--
Sincerely,
The Server
-- End of Text Message --
13.645 seconds elapsed
11.833 seconds (86.72%) seconds parsing multipart/mixed response
0.011 seconds (0.08%) seconds parsing Arrow stream

Closes apache/arrow#40598

@felipecrv

Copy link
Copy Markdown
Contributor Author

@ianmcook

Comment thread http/get_simple/python/server/README.md
Comment thread http/get_multipart/python/client/simple_client.py Outdated
@ianmcook

Copy link
Copy Markdown
Member

Could you please add a small README.md file alongside server.py in the server subdir that briefly explains what the server does?

(similar to https://github.com/apache/arrow-experiments/blob/main/http/get_simple/python/server/README.md)

Comment thread http/get_multipart/python/server/server.py Outdated
@felipecrv felipecrv requested a review from ianmcook August 29, 2024 15:11
@ianmcook

Copy link
Copy Markdown
Member

Can you use carets in the markdown footnotes (like this) so GitHub renders them as footnotes? Thanks

Comment thread http/get_multipart/python/client/simple_client.py Outdated
Comment thread http/get_multipart/python/client/README.md Outdated
@felipecrv

Copy link
Copy Markdown
Contributor Author

Can you use carets in the markdown footnotes (like this) so GitHub renders them as footnotes? Thanks

Done. I was going to lookup the syntax after seeing the bad results.

@felipecrv felipecrv requested a review from ianmcook August 29, 2024 16:36
@ianmcook

Copy link
Copy Markdown
Member

Thanks @felipecrv, this looks great! The only problem I see here is that the calls to feedparser.feed() in the client example are excruciatingly slow—but you've explained clearly that this is an incidental affect of using the Python email module. Maybe later (with lower priority) we can come back and develop a more performant example.

@ianmcook

Copy link
Copy Markdown
Member

I will merge later today if there are no other comments

@felipecrv

Copy link
Copy Markdown
Contributor Author

Thanks @felipecrv, this looks great! The only problem I see here is that the calls to feedparser.feed() in the client example are excruciatingly slow—but you've explained clearly that this is an incidental affect of using the Python email module. Maybe later (with lower priority) we can come back and develop a more performant example.

Yeah. It's the parsing logic. Passing the entire 1GB message blob to email.message_from_bytes() is even slower without accounting for the time it takes to build the buffer.

I called this simple_client.py because later we should include the streaming_client.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python] Create Python examples of HTTP GET Arrow client/server supporting multipart/mixed response

2 participants